High-throughput Ingest of Provenance Records into Accumulo

نویسندگان

  • Thomas Moyer
  • Vijay Gadepally
چکیده

Whole-system data provenance provides deep insight into the processing of data on a system, including detecting data integrity attacks. The downside to systems that collect whole-system data provenance is the sheer volume of data that is generated under many heavy workloads. In order to make provenance metadata useful, it must be stored somewhere where it can be queried. This problem becomes even more challenging when considering a network of provenance-aware machines all collecting this metadata. In this paper, we investigate the use of D4M and Accumulo to support high-throughput data ingest of whole-system provenance data. We find that we are able to ingest 3,970 graph components per second. Centrally storing the provenance metadata allows us to build systems that can detect and respond to data integrity attacks that are captured by the provenance system.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

PACE: Proactively-Secure Accumulo with Cryptographic Enforcement

Cloud-hosted databases have many compelling benefits, including high availability, flexible resource allocation, and resiliency to attack, but it requires that cloud tenants cede control of their data to the cloud provider. In this paper, we describe Proactively-secure Accumulo with Cryptographic Enforcement (PACE), a client-side library that cryptographically protects a tenant’s data, returnin...

متن کامل

The Case for Fine-Grained Stream Provenance

The current state of the art for provenance in data stream management systems (DSMS) is to provide provenance at a high level of abstraction (such as, from which sensors in a sensor network an aggregated value is derived from). This limitation was imposed by high-throughput requirements and an anticipated lack of application demand for more detailed provenance information. In this work, we firs...

متن کامل

Ingestion, Indexing and Retrieval of High-Velocity Multidimensional Sensor Data on a Single Node

Sources of multidimensional data are becoming more prevalent, partly due to the rise of the Internet of Things (IoT), and with that the need to ingest and analyze data streams at rates higher than before. Some industrial IoT applications require ingesting millions of records per second, while processing queries on recently ingested and historical data. Unfortunately, existing database systems s...

متن کامل

Provenance of High Throughput Biomedical Experiments

The field of translational biomedical informatics seeks to integrate knowledge from basic science, directed research into diseases, and clinical insights into a form that can be used to discover effective treatments of diseases. Currently, representations of experimental provenance reside in models specific to each sub-domain: biospecimen management tools track the histories of biospecimens, ho...

متن کامل

Managing the Deluge of Scientific Data

Provenance information in eScience is metadata that's critical to effectively manage the exponentially increasing volumes of scientific data from industrial-scale experiment protocols. Semantic provenance, based on domain-specific provenance ontologies, lets software applications unambiguously interpret data in the correct context. The semantic provenance framework for eScience data comprises e...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1608.03780  شماره 

صفحات  -

تاریخ انتشار 2016